A Clustering Rule Based Approach for Classification Problems

نویسندگان

  • Philicity Williams
  • Caio Soares
  • Juan E. Gilbert
چکیده

Predictive models, such as rule based classifiers, often have difficulty with incomplete data (e.g., erroneous/ missing values). So, this work presents a technique used to reduce the severity of the effects of missing data on the performance of rule base classifiers using divisive data clustering. The Clustering Rule based Approach (CRA) clusters the original training data and builds a separate rule based model on the cluster wise data. The individual models are combined into a larger model and evaluated against test data. The effects of the missing attribute information for ordered and unordered rule sets is evaluated and the collective model (CRA) is experimentally used to show that its performance is less affected than the traditional model when the test data has missing attribute values, thus making it more resilient and robust to missing data. DOI: 10.4018/jdwm.2012010101 2 International Journal of Data Warehousing and Mining, 8(1), 1-23, January-March 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Rule based classifiers are widely used because of the ease of interpretability of the models or set of rules they generate. These classifiers perform exceptionally well on complete data sets, meaning, the data is clean, correct, and does not have missing attribute values. To generate the model or train the classifier, the training process uses attributes and values relative to each other to segregate the data and generate a rule relative to a particular class. This poses a problem with incomplete data as the models produced by traditional rule based classifiers are sensitive to missing attribute values in new/unseen data. Consider, for example, the data set provided in Table 1 and the resulting ID3 Decision Tree provided in Figure 1. The rules derived from this tree are: If Highest Ed is Bachelor and Visa Req is Foreign, then don’t hire. If Highest Ed is Bachelor and Visa Req is US Citizen, then hire. If Highest Ed is Doctorate, then hire. If Highest Ed is Master and Entry Level is Yes, then hire. If Highest Ed is Master and Entry Level is No, then don’t hire. Notice that all of the rules contain the Highest Ed attribute. Suppose there is some test data in which the Highest Ed information Table 1. Hiring training data ID Highest Ed Entry Level Visa Req Hired 1000 Doctorate No US Citizen Yes 1001 Bachelor Yes Foreign No 1002 Master No Foreign Yes 1003 Master No Foreign No 1004 Bachelor No Foreign Yes 1005 Bachelor Yes US Citizen No 1006 Doctorate Yes US Citizen Yes 1007 Master No US Citizen No 1008 Master Yes US Citizen No 1009 Bachelor Yes US Citizen Yes Figure 1. ID3 Decision tree 21 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/clustering-rule-based-approachclassification/61422?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Library Science, Information Studies, and Education. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms

In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...

متن کامل

ارائه‌روش جدید مبتنی‌بر برنامه‌نویسی ژنتیک برای وزن‌دهی قوانین فازی در طبقه‌بندی نامتوازن

In classification problems, we often encounter datasets with different percentage of patterns (i.e. classes with a high pattern percentage and classes with a low pattern percentage). These problems are called “classification Problems with imbalanced data-sets”. Fuzzy rule based classification systems are the most popular fuzzy modeling systems used in pattern classification problems. Rule weights...

متن کامل

Oil Reservoirs Classification Using Fuzzy Clustering (RESEARCH NOTE)

Enhanced Oil Recovery (EOR) is a well-known method to increase oil production from oil reservoirs. Applying EOR to a new reservoir is a costly and time consuming process. Incorporating available knowledge of oil reservoirs in the EOR process eliminates these costs and saves operational time and work. This work presents a universal method to apply EOR to reservoirs based on the available data by...

متن کامل

Optimum Ensemble Classification for Fully Polarimetric SAR Data Using Global-Local Classification Approach

In this paper, a proposed ensemble classification for fully polarimetric synthetic aperture radar (PolSAR) data using a global-local classification approach is presented. In the first step, to perform the global classification, the training feature space is divided into a specified number of clusters. In the next step to carry out the local classification over each of these clusters, which cont...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

A QUADRATIC MARGIN-BASED MODEL FOR WEIGHTING FUZZY CLASSIFICATION RULES INSPIRED BY SUPPORT VECTOR MACHINES

Recently, tuning the weights of the rules in Fuzzy Rule-Base Classification Systems is researched in order to improve the accuracy of classification. In this paper, a margin-based optimization model, inspired by Support Vector Machine classifiers, is proposed to compute these fuzzy rule weights. This approach not only  considers both accuracy and generalization criteria in a single objective fu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJDWM

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2012